average_precision_score (Average Precision, AP)#
Average Precision (AP) summarizes a precision–recall (PR) curve into a single number. It is a ranking metric: it cares about how well you order examples by “how positive” they are.
When to reach for AP: imbalanced binary classification (fraud, rare disease, anomaly detection), information retrieval, recommender ranking.
Goals#
Build intuition for precision/recall and the PR curve
Derive AP and compute it by hand on a tiny example
Implement
average_precision_scorefrom scratch (NumPy)Visualize what AP measures (Plotly)
Use AP for model selection / early stopping in a simple from-scratch logistic regression
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score as sk_average_precision_score
from sklearn.metrics import precision_recall_curve as sk_precision_recall_curve
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Why AP? (especially when positives are rare)#
In many real problems, the positive class is rare:
fraud vs non-fraud
diseased vs healthy
defective vs non-defective
If only 1% of examples are positive, a model that predicts “negative” for everyone gets 99% accuracy — yet it is useless.
A PR curve focuses on the positive class:
Precision answers: “Of the items I flagged, how many are truly positive?”
Recall answers: “Of all true positives, how many did I find?”
AP compresses the entire PR curve into one number (higher is better). The “no-skill” baseline is the positive prevalence $\(\pi = \frac{\#\text{positives}}{n}.\)$
2) Precision/recall at a threshold#
Assume binary labels \(y_i \in \{0,1\}\) and model scores \(s_i \in \mathbb{R}\) (larger means “more positive”).
For a threshold \(\tau\), predict: $\(\hat{y}_i(\tau) = \mathbb{1}[s_i \ge \tau]\)$
Define counts:
\(\mathrm{TP}(\tau)\): predicted 1 and actually 1
\(\mathrm{FP}(\tau)\): predicted 1 but actually 0
\(\mathrm{FN}(\tau)\): predicted 0 but actually 1
Then: $\(\mathrm{Precision}(\tau) = \frac{\mathrm{TP}(\tau)}{\mathrm{TP}(\tau)+\mathrm{FP}(\tau)}\)\( \)\(\mathrm{Recall}(\tau) = \frac{\mathrm{TP}(\tau)}{\mathrm{TP}(\tau)+\mathrm{FN}(\tau)}\)$
Sweeping \(\tau\) from high to low traces out the PR curve.
Ranking view (top-k)#
Instead of thinking in thresholds, you can sort examples by score (descending) and take the top-\(k\) as “predicted positive”. Let \(P@k\) be the precision among the top-\(k\) items, and \(R@k\) be the recall among the top-\(k\) items.
As \(k\) grows, recall is non-decreasing, and we trace the PR curve point-by-point.
# Tiny ranked example (8 items)
y_true_toy = np.array([1, 0, 1, 0, 1, 0, 0, 1])
y_score_toy = np.array([0.90, 0.85, 0.80, 0.70, 0.65, 0.60, 0.40, 0.30])
order = np.argsort(-y_score_toy, kind="mergesort")
y_sorted = y_true_toy[order]
s_sorted = y_score_toy[order]
tp = np.cumsum(y_sorted)
fp = np.cumsum(1 - y_sorted)
precision_at_k = tp / (tp + fp)
recall_at_k = tp / tp[-1] # tp[-1] == number of positives
# AP = mean precision at every rank where we encounter a true positive
ap_toy = precision_at_k[y_sorted == 1].mean()
prevalence_toy = y_true_toy.mean()
df_toy = pd.DataFrame(
{
"rank": np.arange(1, len(y_true_toy) + 1),
"score": s_sorted,
"y_true": y_sorted,
"TP_cum": tp,
"FP_cum": fp,
"precision@k": precision_at_k,
"recall@k": recall_at_k,
"contributes_to_AP": (y_sorted == 1),
}
)
df_toy, ap_toy, prevalence_toy
( rank score y_true TP_cum FP_cum precision@k recall@k \
0 1 0.90 1 1 0 1.000000 0.25
1 2 0.85 0 1 1 0.500000 0.25
2 3 0.80 1 2 1 0.666667 0.50
3 4 0.70 0 2 2 0.500000 0.50
4 5 0.65 1 3 2 0.600000 0.75
5 6 0.60 0 3 3 0.500000 0.75
6 7 0.40 0 3 4 0.428571 0.75
7 8 0.30 1 4 4 0.500000 1.00
contributes_to_AP
0 True
1 False
2 True
3 False
4 True
5 False
6 False
7 True ,
0.6916666666666667,
0.5)
AP as “average precision at each hit”#
In a ranked list, AP has a convenient interpretation:
where \(y_{(k)}\) is the label at rank \(k\), and \(m\) is the number of positives.
So every time you “hit” a true positive in the ranking, you record the current precision; AP is the average of those recorded precisions.
rank = np.arange(1, len(y_sorted) + 1)
pos_mask = y_sorted == 1
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=rank,
y=precision_at_k,
mode="lines+markers",
name="precision@k",
)
)
fig.add_trace(
go.Scatter(
x=rank[pos_mask],
y=precision_at_k[pos_mask],
mode="markers",
name="positions where y=1",
marker=dict(size=10, color="crimson"),
)
)
fig.add_hline(
y=ap_toy,
line_dash="dash",
line_color="crimson",
annotation_text=f"AP = {ap_toy:.3f}",
)
fig.update_layout(
title="Ranked list view: precision@k (AP is the mean at positive hits)",
xaxis_title="Rank k (top-k predicted positives)",
yaxis_title="Precision@k",
yaxis=dict(range=[0, 1.05]),
)
fig.show()
# Build a step PR curve from the ranked list
precision_curve = np.r_[1.0, precision_at_k]
recall_curve = np.r_[0.0, recall_at_k]
# Step-wise area under PR curve (this equals AP)
ap_area = np.sum(np.diff(recall_curve) * precision_curve[1:])
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=recall_curve,
y=precision_curve,
mode="lines",
line_shape="hv",
fill="tozeroy",
name="PR curve (step)",
)
)
fig.add_trace(
go.Scatter(
x=recall_curve[1:][pos_mask],
y=precision_curve[1:][pos_mask],
mode="markers",
marker=dict(size=10, color="crimson"),
name="points where recall increases (true positives)",
)
)
fig.add_trace(
go.Scatter(
x=[0, 1],
y=[prevalence_toy, prevalence_toy],
mode="lines",
line=dict(dash="dash", color="gray"),
name=f"baseline prevalence = {prevalence_toy:.3f}",
)
)
fig.update_layout(
title=f"Precision–Recall curve (toy example) — AP = {ap_toy:.3f} (area = {ap_area:.3f})",
xaxis_title="Recall",
yaxis_title="Precision",
xaxis=dict(range=[0, 1.02]),
yaxis=dict(range=[0, 1.02]),
)
fig.show()
ap_toy, ap_area
(0.6916666666666667, 0.6916666666666667)
3) Average Precision (formal definition)#
A PR curve is a set of points \((R(\tau), P(\tau))\) as we sweep the threshold \(\tau\).
In scikit-learn, average precision is computed as the area under the PR curve using a step function: $\(\mathrm{AP} = \sum_{n} (R_n - R_{n-1})\, P_n,\)\( where the index \)n$ runs over the curve points in increasing recall.
This is equivalent to the ranked-list formula shown earlier (mean of \(\mathrm{Precision}@k\) at every true positive rank).
Two practical notes:
AP only depends on the ordering of scores. Any strictly increasing transform of scores (e.g., \(s \mapsto 10s + 3\)) leaves AP unchanged.
AP is not the same as trapezoidal “AUPRC”. If you need a specific integration convention, be explicit.
def precision_recall_curve_np(y_true, y_score):
"""Simple PR curve from scores.
Returns precision and recall in *increasing recall* order, with a starting point (recall=0, precision=1).
"""
y_true = np.asarray(y_true).astype(int)
y_score = np.asarray(y_score).astype(float)
if y_true.ndim != 1 or y_score.ndim != 1 or y_true.shape[0] != y_score.shape[0]:
raise ValueError("y_true and y_score must be 1D arrays of the same length")
order = np.argsort(-y_score, kind="mergesort")
y_sorted = y_true[order]
s_sorted = y_score[order]
n_pos = int(y_sorted.sum())
if n_pos == 0:
return np.array([1.0]), np.array([0.0]), np.array([])
tp = np.cumsum(y_sorted)
fp = np.cumsum(1 - y_sorted)
precision = tp / (tp + fp)
recall = tp / n_pos
precision = np.r_[1.0, precision]
recall = np.r_[0.0, recall]
# Thresholds corresponding to each added item (top-k): predict positive if score >= threshold
thresholds = s_sorted
return precision, recall, thresholds
def average_precision_score_np(y_true, y_score):
"""Average precision (AP) from scratch.
Equivalent to the mean of precision@k over ranks k where y_true=1 after sorting by score (descending).
"""
y_true = np.asarray(y_true).astype(int)
y_score = np.asarray(y_score).astype(float)
if y_true.ndim != 1 or y_score.ndim != 1 or y_true.shape[0] != y_score.shape[0]:
raise ValueError("y_true and y_score must be 1D arrays of the same length")
order = np.argsort(-y_score, kind="mergesort")
y_sorted = y_true[order]
n_pos = int(y_sorted.sum())
if n_pos == 0:
return 0.0
tp = np.cumsum(y_sorted)
precision_at_k = tp / (np.arange(len(y_sorted)) + 1)
ap = precision_at_k[y_sorted == 1].sum() / n_pos
return float(ap)
# Quick sanity checks vs scikit-learn
y_check = rng.integers(0, 2, size=50)
s_check = rng.normal(size=50)
ap_np = average_precision_score_np(y_check, s_check)
ap_sk = sk_average_precision_score(y_check, s_check)
# AP is invariant to strictly increasing transforms of the scores
ap_np_affine = average_precision_score_np(y_check, 10 * s_check + 3)
ap_np_cubic = average_precision_score_np(y_check, s_check**3)
ap_np, ap_sk, ap_np_affine, ap_np_cubic
(0.6068815234220586,
0.6068815234220585,
0.6068815234220586,
0.6068815234220586)
4) Visual intuition: random vs informative vs (almost) perfect scores#
Let’s keep the labels fixed and only change the quality of the scores. A random scoring function should have AP close to the prevalence \(\pi\). As the ranking improves, the PR curve bends upward and AP increases.
n = 3000
prevalence = 0.05
y_sim = (rng.random(n) < prevalence).astype(int)
scores_random = rng.normal(size=n)
scores_some_signal = 2.0 * y_sim + rng.normal(scale=1.0, size=n)
scores_almost_perfect = y_sim + 0.001 * rng.normal(size=n)
models = {
"random": scores_random,
"some signal": scores_some_signal,
"almost perfect": scores_almost_perfect,
}
fig = go.Figure()
for name, scores in models.items():
prec, rec, _ = precision_recall_curve_np(y_sim, scores)
ap = average_precision_score_np(y_sim, scores)
fig.add_trace(
go.Scatter(
x=rec,
y=prec,
mode="lines",
line_shape="hv",
name=f"{name} (AP={ap:.3f})",
)
)
fig.add_trace(
go.Scatter(
x=[0, 1],
y=[prevalence, prevalence],
mode="lines",
line=dict(dash="dash", color="gray"),
name=f"baseline prevalence={prevalence:.3f}",
)
)
fig.update_layout(
title="PR curves for different score qualities (same labels)",
xaxis_title="Recall",
yaxis_title="Precision",
xaxis=dict(range=[0, 1.02]),
yaxis=dict(range=[0, 1.02]),
)
fig.show()
5) Practical usage (sklearn)#
For binary classification you should pass scores, not hard labels:
y_true: 0/1 labelsy_score: probability for the positive class, or a decision score (any real-valued ranking signal)
In multilabel settings, average_precision_score supports average={"micro","macro","weighted","samples"}.
Conceptually it computes AP per label (one-vs-rest) and then combines them.
# scikit-learn comparison (binary)
ap_np = average_precision_score_np(y_sim, scores_some_signal)
ap_sk = sk_average_precision_score(y_sim, scores_some_signal)
prec_sk, rec_sk, thr_sk = sk_precision_recall_curve(y_sim, scores_some_signal)
# sklearn returns recall in decreasing order; reverse for plotting
prec_sk_inc = prec_sk[::-1]
rec_sk_inc = rec_sk[::-1]
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=rec_sk_inc,
y=prec_sk_inc,
mode="lines",
line_shape="hv",
name="sklearn precision_recall_curve",
)
)
fig.update_layout(
title=f"sklearn PR curve (AP={ap_sk:.3f})",
xaxis_title="Recall",
yaxis_title="Precision",
xaxis=dict(range=[0, 1.02]),
yaxis=dict(range=[0, 1.02]),
)
fig.show()
ap_np, ap_sk
(0.4679788743919764, 0.4679788743919763)
6) Using AP to optimize a simple model (from-scratch logistic regression)#
AP depends on sorting/ranking, so it is not nicely differentiable with respect to model parameters. In practice you usually:
train a model with a differentiable surrogate loss (e.g., log-loss),
select hyperparameters / stop training using a metric like AP on a validation set.
Below we train logistic regression from scratch with gradient descent, and use validation AP to pick:
the best epoch (early stopping)
an \(\ell_2\) regularization strength
# Synthetic imbalanced classification data
X, y = make_classification(
n_samples=5000,
n_features=6,
n_informative=4,
n_redundant=0,
n_clusters_per_class=2,
weights=[0.95, 0.05],
class_sep=1.2,
random_state=42,
)
X_train, X_tmp, y_train, y_tmp = train_test_split(
X, y, test_size=0.4, random_state=42, stratify=y
)
X_val, X_test, y_val, y_test = train_test_split(
X_tmp, y_tmp, test_size=0.5, random_state=42, stratify=y_tmp
)
# Standardize features using train statistics
mu = X_train.mean(axis=0)
sigma = X_train.std(axis=0)
sigma = np.where(sigma == 0, 1.0, sigma)
X_train_s = (X_train - mu) / sigma
X_val_s = (X_val - mu) / sigma
X_test_s = (X_test - mu) / sigma
y_train = y_train.astype(int)
y_val = y_val.astype(int)
y_test = y_test.astype(int)
(y_train.mean(), y_val.mean(), y_test.mean(), X_train_s.shape)
(0.05366666666666667, 0.054, 0.053, (3000, 6))
def sigmoid(z):
"""Numerically stable sigmoid."""
z = np.asarray(z)
out = np.empty_like(z, dtype=float)
pos = z >= 0
out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
exp_z = np.exp(z[~pos])
out[~pos] = exp_z / (1.0 + exp_z)
return out
def predict_proba_logreg(X, w, b):
return sigmoid(X @ w + b)
def log_loss(y_true, y_prob, l2=0.0, w=None, eps=1e-15):
y_true = np.asarray(y_true).astype(float)
y_prob = np.clip(np.asarray(y_prob).astype(float), eps, 1 - eps)
loss = -np.mean(y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob))
if l2 and w is not None:
loss = loss + 0.5 * l2 * float(np.sum(w * w))
return float(loss)
def fit_logreg_gd(
X_train,
y_train,
X_val,
y_val,
lr=0.1,
l2=0.0,
n_epochs=800,
eval_every=10,
):
n_samples, n_features = X_train.shape
w = np.zeros(n_features)
b = 0.0
best = {"ap": -np.inf, "epoch": 0, "w": w.copy(), "b": b}
history = {"epoch": [], "train_loss": [], "val_ap": []}
for epoch in range(1, n_epochs + 1):
p = predict_proba_logreg(X_train, w, b)
err = p - y_train
grad_w = (X_train.T @ err) / n_samples + l2 * w
grad_b = float(np.mean(err))
w -= lr * grad_w
b -= lr * grad_b
if epoch == 1 or epoch % eval_every == 0:
train_loss = log_loss(y_train, predict_proba_logreg(X_train, w, b), l2=l2, w=w)
val_scores = predict_proba_logreg(X_val, w, b)
val_ap = average_precision_score_np(y_val, val_scores)
history["epoch"].append(epoch)
history["train_loss"].append(train_loss)
history["val_ap"].append(val_ap)
if val_ap > best["ap"]:
best = {"ap": val_ap, "epoch": epoch, "w": w.copy(), "b": b}
return best, pd.DataFrame(history)
# Train once and track validation AP (early stopping)
best_run, hist = fit_logreg_gd(
X_train_s, y_train, X_val_s, y_val, lr=0.1, l2=0.1, n_epochs=800, eval_every=10
)
best_run
{'ap': 0.32685299468607276,
'epoch': 1,
'w': array([0. , 0.0044, 0.0011, 0. , 0.0008, 0.005 ]),
'b': -0.04463333333333334}
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=hist["epoch"],
y=hist["train_loss"],
mode="lines",
name="train log-loss",
yaxis="y1",
)
)
fig.add_trace(
go.Scatter(
x=hist["epoch"],
y=hist["val_ap"],
mode="lines",
name="val AP",
yaxis="y2",
)
)
fig.add_vline(
x=best_run["epoch"],
line_dash="dash",
line_color="crimson",
annotation_text=f"best val AP @ epoch {best_run['epoch']}",
)
fig.update_layout(
title="Training curve: optimize log-loss, select by validation AP",
xaxis_title="Epoch",
yaxis=dict(title="Train log-loss"),
yaxis2=dict(title="Validation AP", overlaying="y", side="right", range=[0, 1.0]),
)
fig.show()
test_ap = average_precision_score_np(
y_test, predict_proba_logreg(X_test_s, best_run["w"], best_run["b"])
)
test_ap
0.3175857876688931
# Use validation AP to choose a hyperparameter (L2 regularization strength)
l2_grid = np.logspace(-4, 0, 7)
results = []
best_overall = None
for l2 in l2_grid:
best, _hist = fit_logreg_gd(
X_train_s, y_train, X_val_s, y_val, lr=0.1, l2=float(l2), n_epochs=800, eval_every=20
)
results.append({"l2": float(l2), "best_val_ap": best["ap"], "best_epoch": best["epoch"]})
if best_overall is None or best["ap"] > best_overall["best_val_ap"]:
best_overall = {"l2": float(l2), "best_val_ap": best["ap"], "best": best}
df_grid = pd.DataFrame(results).sort_values("l2")
df_grid
| l2 | best_val_ap | best_epoch | |
|---|---|---|---|
| 0 | 0.000100 | 0.326853 | 1 |
| 1 | 0.000464 | 0.326853 | 1 |
| 2 | 0.002154 | 0.326853 | 1 |
| 3 | 0.010000 | 0.326853 | 1 |
| 4 | 0.046416 | 0.326853 | 1 |
| 5 | 0.215443 | 0.326853 | 1 |
| 6 | 1.000000 | 0.327212 | 760 |
fig = px.line(
df_grid,
x="l2",
y="best_val_ap",
markers=True,
log_x=True,
title="Validation AP vs L2 strength (pick the maximum)",
)
fig.update_layout(xaxis_title="L2 strength (lambda)", yaxis_title="Best validation AP")
fig.show()
best_l2 = best_overall["l2"]
w_star = best_overall["best"]["w"]
b_star = best_overall["best"]["b"]
test_ap_star = average_precision_score_np(y_test, predict_proba_logreg(X_test_s, w_star, b_star))
best_l2, best_overall["best_val_ap"], test_ap_star
(1.0, 0.3272119853443016, 0.3165096666872798)
7) Pros, cons, and common pitfalls#
Pros#
Good for imbalanced data: focuses on the positive class; baseline is the prevalence.
Threshold-free summary: evaluates the whole ranking, not just one operating point.
Ranking metric: invariant to any strictly increasing transform of scores.
Cons / pitfalls#
Depends on prevalence: AP values aren’t directly comparable across datasets with different base rates.
High variance with few positives: a handful of positives can swing AP a lot.
Not symmetric: swapping the positive/negative label changes AP.
Not differentiable in a simple way: optimizing AP directly usually requires surrogate losses (pairwise ranking, smooth approximations, etc.).
Not an operating-point metric: if you care about a specific constraint (e.g., precision ≥ 0.9), also inspect the PR curve or use precision@k / recall@k.
Integration convention: AP (step function) differs from trapezoidal AUPRC; don’t mix them silently.
Exercises#
Show numerically (with random data) that:
the step-area formula \(\sum_n (R_n - R_{n-1})P_n\) equals the “mean precision at true-positive ranks” formula.
Compare ROC AUC vs AP on a heavily imbalanced dataset. Create two models with similar ROC AUC but very different AP.
Multilabel: generate labels with different prevalences per label and compare
average="macro"vsaverage="micro".
References#
scikit-learn API docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html
scikit-learn PR curve: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html